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Abstract In this chapter we discuss conceptually high dimensional sparse econo- 
metric models as well as estimation of these models using l\ -penalization and post- 
al-penalization methods. Focusing on linear and nonparametric regression frame- 
works, we discuss various econometric examples, present basic theoretical results, 
and illustrate the concepts and methods with Monte Carlo simulations and an empir- 
ical application. In the application, we examine and confirm the empirical validity 
of the Solow-Swan model for international economic growth. 



1 The High Dimensional Sparse Econometric Model 

We consider linear, high dimensional sparse (HDS) regression models in economet- 
rics. The HDS regression model has a large number of regressors p, possibly much 
larger than the sample size n, but only a relatively small number s < n of these re- 
gressors are important for capturing accurately the main features of the regression 
function. The latter assumption makes it possible to estimate these models effec- 
tively by searching for approximately the right set of the regressors, using l\ -based 
penalization methods. In this chapter we will review the basic theoretical properties 
of these procedures, established in the works ofl8l[T0l[T8l[l7ir7l[T5l[T3ll271 l26l. 
among others (see [20. 7 1 for a detailed literature review). In this section, we review 
the modeling foundations as well as motivating examples for these procedures, with 
emphasis on applications in econometrics. 

Let us first consider an exact or parametric HDS regression model, namely, 



Alexandre Belloni 

Duke University, Fuqua School of Business, 100 Fuqua Drive, Durham, NC, e-mail: 

abn5@duke . edu 

Victor Chernozhukov 

Massachusetts Institute of Technology, Department of Economics, 50 Memorial Drive, Cambridge, 
MA e-mail: vchern@ mit . edul 



1 



2 



Belloni and Chemozhukov 



y i =J i Po + e i , e,~/V(0,c7 2 ), j3 e R p , i= l,...,n, (1) 

where y/'s are observations of the response variable, xi's are observations of p- 
dimensional fixed regressors, and e,'s are i.i.d. normal disturbances, where possibly 
p ^ n. The key assumption of the exact model is that the true parameter value /3o is 
sparse, having only s < n non-zero components with support denoted by 

T= support(j3 )c{l,...,/?}. (2) 

Next let us consider an approximate or nonparametric HDS model. To this end, let 
us introduce the regression model 

y,=/(z,) + e,-, ei~N(0,o 2 ), i = l,...,n, (3) 

where >',■ is the outcome, Zi is a vector of elementary fixed regressors, z H> f(z) is the 
true, possibly non-linear, regression function, and £,'s are i.i.d. normal disturbances. 
We can convert this model into an approximate HDS model by writing 

yi = x% + ri + ei, i=l,...,n, (4) 

where x, = P(zi) is a /^-dimensional regressor formed from the elementary regressors 
by applying, for example, polynomial or spline transformations, is a conformable 
parameter vector, whose "true" value j3o has only s <n non-zero components with 
support denoted as in (fZ|i, and r, := r{zi) = f(zi) —x^Po is the approximation error. 
We shall define the true value j3o more precisely in the next section. For now, it is 
important to note only that we assume there exists a value /3o having only s non-zero 
components that sets the approximation error r, to be small. 

Before considering estimation, a natural question is whether exact or approxi- 
mate HDS models make sense in econometric applications. In order to answer this 
question it is helpful to consider the following example, in which we abstract from 
estimation completely and only ask whether it is possible to accurately describe 
some structural econometric function f(z) using a low-dimensional approximation 
of the form P(z)'j5o. In particular, we are interested in improving upon the conven- 
tional low-dimensional approximations. 

Example 1: Sparse Models for Earning Regressions. In this example we con- 
sider a model for the conditional expectation of log-wage yi given education zu 
measured in years of schooling. Since measured education takes on a finite number 
of years, we can expand the conditional expectation of wage y; given education zc 

E\yi\zi] = ^P 0j Pj(zi), (5) 

7=1 

using some dictionary of approximating functions Pi (z,), . . . ,P p (zj), such as polyno- 
mial or spline transformations in z, and/or indicator variables for levels of zt- In fact, 
since we can consider an overcomplete dictionary, the representation of the function 
may not be unique, but this is not important for our purposes. 
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A conventional sparse approximation employed in econometrics is, for example, 
f(zi) := E\yi\zi\ = jSiPi (z,) + • • • + p s P s (z.) + f b (6) 

where the Pj's are low-order polynomials or splines, with typically s = 4 or 5 terms, 
but there is no guarantee that the approximation error r,- in this case is small, or that 
these particular polynomials form the best possible s-dimensional approximation. 
Indeed, we might expect the function £[y,-|z,-] to exhibit oscillatory behavior near 
the schooling levels associated with advanced degrees, such as MBA or MD. Low- 
degree polynomials may not be able to capture this behavior very well, resulting in 
large approximation errors r,'s. 

Therefore, the question is: With the same number of parameters, can we find a 
much better approximation? In other words, can we find some higher-order terms in 
the expansion (0 which will provide a higher-quality approximation? More specif- 
ically, can we construct an approximation 

f( Zi ) :=E\y i \zi\=P h Pki(zi) + ---+PkAM) + r i, ( 7 ) 

for some regressor indices k\, . . . ,k s selected from {1, . . -,£>}, that is accurate and 
much better than ||6), in the sense of having a much smaller approximation error rp. 

Obviously the answer to the latter question depends on how complex the be- 
havior of the true regression function (0 is. If the behavior is not complex, then 
low-dimensional approximation should be accurate. Moreover, it is clear that the 
second approximation (0 is weakly better than the first (0, and can be much better 
if there are some important high-order terms in (0 that are completely missed by 
the first approximation. Indeed, in the context of the earning function example, such 
important high-order terms could capture abrupt positive changes in earning associ- 
ated with advanced degrees such as MBA or MD. Thus, the answer to the question 
depends strongly on the empirical context. 

Consider for example the earnings of prime age white males in the 2000 U.S. 
Census (see e.g., Angrist, Chernozhukov and Fernandez- Val J2j). Treating this data 
as the population data, we can then compute /(z;) = E[j,-|z,-] without error. Fig- 
ure Q] plots this function. (Of course, such a strategy is not generally available in 
the empirical work, since the population data are generally not available.) We then 
construct two sparse approximations and also plot them in Figure Q] the first is the 
conventional one, of the form (0, with P\,...,P S representing an (s — 1) -degree 
polynomial, and the second is an approximation of the form ©, with P^, . . . , P^ 
consisting of a constant, a linear term, and two linear splines terms with knots lo- 
cated at 16 and 19 years of schooling (in the case of s = 5 a third knot is located 
at 17). In fact, we find the latter approximation automatically using i\ -penalization 
methods, although in this special case we could construct such an approximation 
just by eye-balling Figure Q] and noting that most of the function is described by 
a linear function, with a few abrupt changes that can be captured by linear spline 
terms that induce large changes in slope near 17 and 19 years of schooling. Note 
that an exhaustive search for a low-dimensional approximation requires looking at 
a very large set of models. We avoided this exhaustive search by using l\ -penalized 
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least squares (LASSO), which penalizes the size of the model through the sum of 
absolute values of regression coefficients. Table [T]quantifies the performance of the 
different sparse approximations. (Of course, a simple strategy of eye-balling also 
works in this simple illustrative setting, but clearly does not apply to more gen- 
eral examples with several conditioning variables zu for example, when we want to 
condition on education, experience, and age.) □ 



Sparse Approximation s Li error L M error 

Conventional 4 0.1212 0.2969 

Conventional 5 0.1210 0.2896 

LASSO 4 0.0865 0.1443 

LASSO 5 0.0752 0.1154 

Post-LASSO 4 0.0586 0.1334 

Post-LASSO 5 0.0397 0.0788 



Table 1 Errors of Conventional and the LASSO-based Sparse Approximations of the Earning 
Function. The LASSO estimator minimizes the least squares criterion plus the i?i-norm of the 
coefficients scaled by a penalty parameter X . As shown later, it turns out to have only a few non- 
zero components. The Post-LASSO estimator minimizes the least squares criterion over the non- 
zero components selected by the LASSO estimator. 



The next two applications are natural examples with large sets of regressors 
among which we need to select some smaller sets to be used in further estimation 
and inference. These examples illustrate the potential wide applicability of HDS 
modeling in econometrics, since many classical and new data sets have naturally 
multi-dimensional regressors. For example, the American Housing Survey records 
prices and multi-dimensional features of houses sold, and scanner data-sets record 
prices and multi-dimensional information on products sold at a store or on the inter- 
net. 

Example 2: Instrument Selection in Angrist and Krueger Data. The second 
example we consider is an instrumental variables model, as in Angrist and Krueger 

yn = Oo + diya+w'j+Vi, E[vi\wi,Xi] = 0, 
yn = -tP + w' t 8 + Bj, E[Ei\wi,Xi] = 0, 

where, for person i, yn denotes wage, ya denotes education, w,- denotes a vector 
of control variables, and x, denotes a vector of instrumental variables that affect 
education but do not directly affect the wage. The instruments x, come from the 
quarter-of -birth dummies, and from a very large list, total of 180, formed by inter- 
acting quarter-of-birth dummies with control variables w,. The interest focuses on 
measuring the coefficient 9u which summarizes the causal impact of education on 
earnings, via instrumental variable estimators. 

There are two basic options used in the literature: one uses just the quarter-of- 
birth dummies, that is, the leading 3 instruments, and another uses all 183 instru- 
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Traditional vs LASSO approximations 
with 4 coefficients 




Expected Wage Function 
Post-LASSO Approximation 
Traditional Approximation 



12 14 16 

Education 



Traditional vs LASSO approximations 




Expected Wage Function 
Post-LASSO Approximation 
Traditional Approximation 



12 14 

Education 



Fig. 1 The figures illustrates the Post-LASSO sparse approximation and the traditional (low degree 
polynomial) approximation of the wage function. The top figure uses s = 4 and the bottom figure 
uses s = 5. 



ments. It is well known that using just 3 instruments results in estimates of the 
schooling coefficient 6\ that have a large variance and small bias, while using 183 
instruments results in estimates that have a much smaller variance but (potentially) 
large bias, see, e.g., fl4l . It turns out that, under some conditions, by using l\ -based 
estimation of the first stage, we can construct estimators that also have a nearly ef- 
ficient variance and at the same time small bias. Indeed, as shown in Table|2] using 
the LASSO estimator induced by different penalty levels defined in Section|2] it is 
possible to find just 37 instruments that contain nearly all information in the first 
stage equation. Limiting the number of the instruments from 183 to just 37 reduces 
the bias of the final instrumental variable estimator. For a further analysis of IV 
estimates based on LASSO-selected instruments, we refer the reader to j6). 
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Table 2 Instrumental Variable Estimates of Return to Schooling in Angrist and Krueger Data 



Instruments 


Return to Schooling 


Robust Std Error 


3 


0.1077 


0.0201 


180 


0.0928 


0.0144 


LASSO-selected 






5 


0.1062 


0.0179 


7 


0.1034 


0.0175 


17 


0.0946 


0.0160 


37 


0.0963 


0.0143 



□ 

Example 3: Cross-country Growth Regression. One of the central issues in 
the empirical growth literature is estimating the effect of an initial (lagged) level of 
GDP (Gross Domestic Product) per capita on the growth rates of GDP per capita. In 
particular, a key prediction from the classical Solow-Swan-Ramsey growth model 
is the hypothesis of convergence, which states that poorer countries should typically 
grow faster and therefore should tend to catch up with the richer countries. Such 
a hypothesis implies that the effect of the initial level of GDP on the growth rate 
should be negative. As pointed out in Barro and Sala-i-Martin [5 |, this hypothesis 
is rejected using a simple bivariate regression of growth rates on the initial level of 
GDP. (In this data set, linear regression yields an insignificant positive coefficient 
of 0.0013.) In order to reconcile the data and the theory, the literature has focused 
on estimating the effect conditional on the pertinent characteristics of countries. Co- 
variates that describe such characteristics can include variables measuring education 
and science policies, strength of market institutions, trade openness, savings rates 
and others f5j. The theory then predicts that for countries with similar other charac- 
teristics the effect of the initial level of GDP on the growth rate should be negative 
(13). Thus, we are interested in a specification of the form: 

p 

yt = Oq + cti log Gj + £ PjXij + Bt, (8) 

7=1 

where y, is the growth rate of GDP over a specified decade in country i, G; is the 
initial level of GDP at the beginning of the specified period, and the Xy's form a 
long list of country z's characteristics at the beginning of the specified period. We 
are interested in testing the hypothesis of convergence, namely that a,\ < 0. 

Given that in standard data-sets, such as Barro and Lee data (4), the number 
of covariates p we can condition on is large, at least relative to the sample size n, 
covariate selection becomes a crucial issue in this analysis ( 1T61 . 112210 . In particular, 
previous findings came under severe criticism for relying on ad hoc procedures for 
covariate selection. In fact, in some cases, all of the previous findings have been 
questioned ( lfl6l ). Since the number of covariates is high, there is no simple way to 
resolve the model selection problem using only classical tools. Indeed the number of 
possible lower-dimensional models is very large, although j 161] and (22j attempt to 
search over several millions of these models. We suggest i\ -penalization and post- 
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^-penalization methods to address this important issue. In Section [8j using these 
methods we estimate the growth model ([8]) and indeed find rather strong support for 
the hypothesis of convergence, thus confirming the basic implication of the Solow- 
Swan model. □ 

Notation. In what follows, all parameter values are indexed by the sample size «, 
but we omit the index whenever this does not cause confusion. In making asymptotic 
statements, we assume that n — s- °° and p = p n —> °°, and we also allow for s = s n —> 
oo. We use the notation (a)+ = max{a,0}, a\Jb = max{a,b} andaAb = min{a,b}. 
The £2-norm is denoted by || • || and the "£o-norm" || • ||o denotes the number of 
non-zero components of a vector. Given a vector 5 € K p , and a set of indices T C 
{ 1 , . . . , p}, we denote by St the vector in which 8jj = 8j if j E T, 8jj = if j ^T. 
We also use standard notation in the empirical process literature, 

E n [/]=E„[/K)] = £/(w ( -)/«, 

and we use the notation a<bto denote a^cb for some constant c > that does not 
depend on «; and a <p b to denote a = Op(b). Moreover, for two random variables 
X,7 we say that X =j Y if they have the same probability distribution. We also 
define the prediction norm associated with the empirical Gram matrix E„ [jcfxj] as 

||S|| 2i „ = ^/E n [(^)2]. 



2 The Setting and Estimators 
2.1 The Model 

Throughout the rest of the chapter we consider the nonparametric model introduced 
in the previous section: 

yi + e,~N(0,o 2 ), i=l,...,», (9) 

where yi is the outcome, n is a vector of fixed regressors, and e,'s are i.i.d. distur- 
bances. Define x, = P(zi), where P(zi) is a p-vector of transformations of zi, includ- 
ing a constant, and /; = /(z/). For a conformable sparse vector j3o to be defined 
below, we can rewrite (O in an approximately parametric form: 

yi = x'fio + lit, ui = n + £,-,/= 1 , . . . , n, (10) 

where r, :~ — x'j3o, i = 1,...,«, are approximation errors. We note that in the 
parametric case, we may naturally choose x'j3o = fi so that r, = for all i = 1 , . . . , n. 
In the nonparametric case, we shall choose x'fio as a sparse parametric model that 
yields a good approximation to the true regression function fi in equation (9). 
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Given ( [Tol l, our target in estimation will become the parametric function X;j3o- 
Here we emphasize that the ultimate target in estimation is, of course, ft, while 
x'j3o is a convenient intermediate target, introduced so that we can approach the 
estimation problem as if it were parametric. Indeed, the two targets are equal up 
to approximation errors r,'s that will be set smaller than estimation errors. Thus, 
the problem of estimating the parametric target x'fia is equivalent to the problem of 
estimating the non-parametric target /; modulo approximation errors. 

With that in mind, we choose our target or "true" /3o, with the corresponding 
cardinality of its support 

s=\\Mo, 

as any solution to the following ideal risk minimization or oracle problem: 

min E„[(fi - x ifif] + c 2 ^. (11) 
Pes.? « 

We call this problem the oracle problem for the reasons explained below, and we 
call 

T = support(j8o) 

the oracle or the "true" model. Note that we necessarily have that s ^ n. 

The oracle problem (fTTT i balances the approximation error E„[(/) — x'fi) 2 ] over 
the design points with the variance term <7 2 ||j3||o/«, where the latter is determined 
by the number of non-zero coefficients in j3 . Letting 

c 2 s :=E n [r 2 ]=E n [(f-x%) 2 } 

denote the average square error from approximating values /, by A;j8o, the quantity 
c 2 + <j 2 s/n is the optimal value of (fTTT i. Typically, the optimality in (fTTT i would 
balance the approximation error with the variance term so that for some absolute 
constant K > 

c s ^ KOy/Tfil, (12) 

so that ycf + <J 2 s/n < a^Js/n. Thus, the quantity a^J sjn becomes the ideal goal 
for the rate of convergence. If we knew the oracle model T, we would achieve this 
rate by using the oracle estimator, the least squares estimator based on this model, 
but we in general do not know T, since we do not observe the fi's to attempt to 
solve the oracle problem (fTTT i. Since T is unknown, we will not be able to achieve 
the exact oracle rates of convergence, but we can hope to come close to this rate. 

We consider the case of fixed design, namely we treat the covariate values 
Jti, . . . ,x n as fixed. This includes random sampling as a special case; indeed, in this 
case Xi , . . . ,x„ represent a realization of this sample on which we condition through- 
out. Without loss of generality, we normalize the covariates so that 

dj=-En[x?j} = lforj = l,...,p. (13) 

We summarize the setup as the following condition. 
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Condition ASM. We have data {{y>i,Zj),i = 1, . . . ,«} that for each n obey the re- 
gression model (O, which admits the approximately sparse form ( 1701 ) induced by 
( 1771 ) with the approximation error satisfying ( I72D . The regressors Xi = P(zi) are nor- 
malized as in ( 1731 ). 



Remark 1 ( On the Oracle Problem). Let us now briefly explain what is behind prob- 
lem (fTTT i. Under some mild assumptions, this problem directly arises as the (infea- 
sible) oracle risk minimization problem. Indeed, consider an OLS estimator P[T], 
which is obtained by using a model T, i.e. by regressing yi on regressors Xj[T}, where 
Xi[T] — {xijj G T}. This estimator takes value j5[T] = 'E„[xi[T]xj[T]']~'E, n [xi[T]yi]. 
The expected risk of this estimator ¥, n E[f — Xj[T]'[} [T]] 2 is equal to 

min E4(/,-x,[r]'j3) 2 ] + (7 2 -, 

where k — rank(E„[jc, [r].)c, [r] / ]). The oracle knows the risk of each of the models T 
and can minimize this risk 

min min E„[(/,- - Xi [f ]'j3) 2 ] + (J 2 -, 
T /j G Rin n 

by choosing the best model or the oracle model T. This problem is in fact equivalent 
to (fTTT i. provided that rank (E„[x, [r]x, [r]']) = ||/3o||oi i- e - Ml rank. Thus, in this case 
the value j3o solving ( fTTT i is the expected value of the oracle least squares estimator 

p T = E^rJjcffl-^frpM Le - = Mxi[T}xi{TYV^n[xi{T\fi]. This value 
is our target or "true" parameter value and the oracle model T is the target or "true" 
model. Note that when c s = we have that /, = ^j3o, which gives us the special 
parametric case. 



2.2 LASSO and Post-LASSO Estimators 

Having introduced the model ( fTOb with the target parameter defined via ( fTTl ). our 
task becomes to estimate /3o- We will focus on deriving rate of convergence results 
in the prediction norm, which measures the accuracy of predicting x'fio over the 
design points x\ , . . . ,x„, 

\\8\\ 2 ,n = ^m 2 - 

In what follows 8 will denote deviations of the estimators from the true parameter 
value. Thus, e.g., for 8 = j3 — /3q, the quantity ||5||2„ denotes the average of the 
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square errors x'j3 — x^j3o resulting from using the estimate x\$ instead of X;/3o- Note 
that once we bound /3 — /3o in the prediction norm, we can also bound the empirical 
risk of predicting values by x'fi via the triangle inequality: 

^ n [{x'fi-fif] < \\$-M2,n + C s . (14) 

In order to discuss estimation consider first the classical ideal AIC/BIC type es- 
timator (|H]|23|) that solves the empirical (feasible) analog of the oracle problem: 

rnin £(/?) + -||/J|| , 

where QQ3) = E n [(y; - x'fi) 2 ] and ||j3|| = L P j=l H\Pj\ > 0} is the i -nomx and 
A is the penalty level. This estimator has very attractive theoretical properties, but 
unfortunately it is computationally prohibitive, since the solution to the problem 
may require solving Jj^„ (10 l east squares problems (generically, the complexity of 
this problem is NP-hard |fl9l[T2l ). 

One way to overcome the computational difficulty is to consider a convex re- 
laxation of the preceding problem, namely to employ the closest convex penalty - 
the l\ penalty - inplace of the £q penalty. This construction leads to the so called 
LASSO estimator^ 

£eargmine(/3) + -||/3||i, (15) 

PeV n 

where as before g(j3) = E„[(y; - x'fif] and ||/3||i = Z P j=1 |J3,-|. The LASSO esti- 
mator minimizes a convex function. Therefore, from a computational complexity 
perspective, ( Tl3T > is a computationally efficient (i.e. solvable in polynomial time) 
alternative to AIC/BIC estimator. 

In order to describe the choice of A, we highlight that the following key quantity 
determining this choice: 

5 = 2E n [x ( -e ( -], 

which summarizes the noise in the problem. We would like to choose the smaller 
penalty level so that 

A ^ c«||5||oo with probability at least 1 — a, (16) 

where 1 — a needs to be close to one, and c is a constant such that c > 1 . Following 
and [8), respectively, we consider two choices of A that achieve the above: 

X -independent penalty: A := 2ca^/n<P ~' (1 — a/2p), (17) 
X-dependent penalty: A := 2coA (1 - a\X), (18) 

where a € (0, 1) and c > 1 is constant, and 



1 The abbreviation LASSO stands for Least Absolute Shrinkage and Selection Operator, c.f. 1241 . 
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A(l - a\X) := (1 - a) -quantile of n\\S / (2a)\\oc, 
conditional on X = (x\,... ,x n )' . Note that 

||5/(2tT)|| o= (j max |E„[x ;j ^]|, where gi's are i.i.d. N(0, 1), 

conditional on X, so we can compute A(l — u\X) simply by simulating the lat- 
ter quantity, given the fixed design matrix X. Regarding the choice of a and c, 
asymptotically we require a — > as n — s- °° and c > 1 . Non-asymptotically, in our 
finite-sample experiments, a = .1 and c = 1.1 work quite well. The noise level a 
is unknown in practice, but we can estimate it consistently using the approach of 
Section 6. We recommend the X-dependent rule over the X-independent rule, since 
the former by construction adapts to the design matrix X and is less conservative 
than the latter in view of the following relationship that follows from Lemma|8] 

A{l-a\X) ^ v/n* _1 (l- a/2p) ^ yj2n\og{2p/a). (19) 

Regularization by the ^j-norm employed in cfT~5T > naturally helps the LASSO esti- 
mator to avoid overrating the data, but it also shrinks the fitted coefficients towards 
zero, causing a potentially significant bias. In order to remove some of this bias, let 
us consider the Post-LASSO estimator that applies ordinary least squares regression 
to the model T selected by LASSO. Formally, set 

f = suppart(0) = O' €{!>•••,/>} = IA/l>0}, 
and define the Post-LASSO estimator j3 as 

B G arg min Q(B) ■B i = Q for each ; G f c , (20) 

pew 1 

where T c = {l,...,p}\T. In words, the estimator is ordinary least squares applied 
to the data after removing the regressors that were not selected by LASSO. If the 
model selection works perfectly - that is, T = T - then the Post-LASSO estimator 
is simply the oracle estimator whose properties are well known. However, perfect 
model selection might be unlikely for many designs of interest, so we are especially 
interested in the properties of Post-LASSO in such cases, namely when T T, 
especially when T <^T. 



2.3 Intuition and Geometry of LASSO and Post-LASSO 

In this section we discuss the intuition behind LASSO and Post-LASSO estimators 
defined above. We shall rely on a dual interpretation of the LASSO optimization 
problem to provide some geometrical intuition for the performance of LASSO. In- 
deed, it can be seen that the LASSO estimator also solves the following optimization 
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program: 

min 11)311! : G(j3)sc y (21) 

for some value of y ^ (that depends on the penalty level X). Thus, the estimator 
minimizes the ^i-norm of coefficients subject to maintaining a certain goodness- 
of-fit; or, geometrically, the LASSO estimator searches for a minimal €i -ball - the 
diamond- subject to the diamond having a non-empty intersection with a fixed lower 
contour set of the least squares criterion function - the ellipse. 

In Figure [2] we show an illustration for the two-dimensional case with the true 
parameter value (j3oi,j3o2) equal (1,0), so that T = support(/3o) = {1} and s = 1. In 
the figure we plot the diamonds and ellipses. In the top figure, the ellipse represents 
a lower contour set of the population criterion function Q(f5) = E[(yt — x'/3) 2 ] in 
the zero noise case or the infinite sample case. In the bottom figures the ellipse 
represents a contour set of the sample criterion function 2(j3) = E„[(yi — x'fi) 2 ] 
in the non-zero noise or the finite sample case. The set of optimal solutions j3 for 
LASSO is then given by the intersection of the minimal diamonds with the ellipses. 
Finally, recall that Post-LASSO is computed as the ordinary least square solution 
using covariates selected by LASSO. Thus, Post-LASSO estimate /3 is given by the 
center of the ellipse intersected with the linear subspace selected by LASSO. 

In the zero-noise case or in population (top figure), LASSO easily recovers the 
correct sparsity pattern of j3o- Note that due to the regularization, in spite of the 
absence of noise, the LASSO estimator has a large bias towards zero. However, in 
this case Post-LASSO /3 removes the bias and recovers j3o perfectly. 

In the non-zero noise case (middle and bottom figures), the contours of the crite- 
rion function and its center move away from the population counterpart. The empir- 
ical error in the middle figure moves the center of the ellipse to a non-sparse point. 
However, LASSO correctly sets = and /3i 7^ recovering the sparsity pattern of 
j3o- Using the selected support, Post-LASSO j3 becomes the oracle estimator which 
drastically improves upon LASSO. In the case of the bottom figure, we have large 
empirical errors that push the center of the lower contour set further away from 
the population counterpart. These large empirical errors make the LASSO estima- 
tor non-sparse, incorrectly setting 7^ 0. Therefore, Post-LASSO uses T = {1,2} 
and does not use the exact support T = {1}. Thus, Post-LASSO is not the oracle 
estimator in this case. 

All three figures also illustrate the shrinkage bias towards zero in the LASSO 
estimator that is introduced by the £i-norm penalty. The Post-LASSO estimator 
is motivated as a solution to remove (or at least alleviate) this shrinkage bias. In 
cases where LASSO achieves a good sparsity pattern, Post-LASSO can drastically 
improve upon LASSO. 
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2.4 Primitive conditions 



In both the parametric and non-parametric models described above, whenever p > n, 
the empirical Gram matrix E„ [jcfjtj-] does not have full rank and hence it is not well- 
behaved. However, we only need good behavior of certain moduli of continuity of 
the Gram matrix called restricted sparse eigenvalues. We define the minimal re- 
stricted sparse eigenvalue 

jc(m) z := mm , (22) 

\\S T c\\ ^m,SjtQ \\5\\ 2 

and the maximal restricted sparse eigenvalue as 

0H:= ||V|&,oW (3) 

where m is the upper bound on the number of non-zero components outside the 
support T. To assume that fc(m) > requires that all empirical Gram submatrices 
formed by any m components of xt in addition to the components in T are posi- 
tive definite. It will be convenient to define the following sparse condition number 
associated with the empirical Gram matrix: 



MH = 4^- (24) 
K(m) 

In order to state simplified asymptotic statements, we shall also invoke the fol- 
lowing condition. 

Condition RSE. Sparse eigenvalues of the empirical Gram matrix are well behaved, 
in the sense that for m = m„ = slog 7? 

H(m)<l, <Hm)<l, 1/K(m)<l. (25) 

This condition holds with high probability for many designs of interest under 
mild conditions on s. For example, as shown in Lemma Q] when the covariates are 
Gaussians, the conditions in d25l l are true with probability converging to one under 
the mild assumption that slog/? = o(n). Condition RSE is likely to hold for other 
regressors with jointly light-tailed distributions, for instance log-concave distribu- 
tion. As shown in Lemma [2] the conditions in ( T25l l also hold for general bounded 
regressors under the assumption that s(log 4 «)log(/? V«) = o(n). Arbitrary bounded 
regressors often arise in non-parametric models, where regressors x, are formed 
as spline, trigonometric, or polynomial transformations P(zi) of some elementary 
bounded regressors Zi- 



Lemma 1 (Gaussian design). Suppose I„ i = l,...,n, are Ltd. zero-mean Gaus- 
sian random vectors, such that the population design matrix E[xix!^\ has ones on 
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the diagonal, and its slogn-sparse eigenvalues are bounded from above by (p < °° 
and bounded from below by K 2 > 0. Define x\ as a normalized form of Xi, namely 

Xjj = Xij/ yJ'Enlxfj]. Then for any m ^ (slog(«/e)) A (n/[161ogp]), with probability 

at least 1 — 2exp(— n/16), 

<j)(m) < 8<p, K(m) > k/6v2, and jU(m) < 24 v / <p/K\ 

Lemma 2 (Bounded design). Suppose x„ / = 1, . . . ,n, are vectors, such that 
the population design matrix -Efx/i'] has ones on the diagonal, and its slogn- 
sparse eigenvalues are bounded from above by <p < °° and bounded from below by 
K 2 > 0. Define xi as a normalized form ofxu namely Xij = Xij / (E„ [JE?] ) 1 1 2 . Suppose 
that maxi<i<„ \\xi\\oo < K n a.s., and K 2 s\og 2 {n)\og 2 {s\ogn)\og{p\J n) = o(nK A /<p). 
Then, for any m > 5mc/i ?/iaf m + s < s log n, we /ia ve f/ia? as n — > °° 

<p(m)^4(p, K(m)^K/2, and /i(m) ^ 4 v / ^/k', 

with probability approaching 1. 

For proofs, see Q; the first lemma builds upon results in l26l and the second 
builds upon results in J2T). 



3 Analysis of LASSO 

In this section we discuss the rate of convergence of LASSO in the prediction norm; 
our exposition follows mainly |f8l . 

The key quantity in the analysis is the following quantity called "score": 

S = S(A>)=2E B [x i £ I -]. 

The score is the effective "noise" in the problem. Indeed, defining 8 := /3 — /3o, note 
that by the Holder's inequality 

Q(fi) - e(/3o) - ||5||2„ = -2EUM5] -2E»[r*{5] (26) 
>-||S||„||5||i-2c,||5|| 2 ^. 

Intuition suggests that we need to majorize the "noise term" by the penalty 
level X/n, so that the bound on ||5||2„ will follow from a relation between the 
prediction norm 1 1 • 1 1 2. n and the penalization norm 1 1 • 1 1 1 on a suitable set. Specifically, 
for any c > 1 , it will follow that if 

X ^ cn||S||„ 
and || 5 1 1 2 „ 2c s , the vector 8 will also satisfy 
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IIMi^cllMi, (27) 

where c = (c + l)/(c — 1). That is, in this case the error in the regularization norm 
outside the true support does not exceed c times the error in the true support. (In the 
case ||5||2,« ^ 2c s the inequality (|27] | may not hold, but the bound ||<5||2.„ 2c s is 
already good enough.) 

Consequently, the analysis of the rate of convergence of LASSO relies on the 
so-called restricted eigenvalue Kg, introduced in (8 j, which controls the modulus of 
continuity between the prediction norm || • ||2,« and the penalization norm || • || i over 
the set of vectors 5 eW that satisfy (|27T >: 

Kg:= mm " , (RE(c)) 

llVlli^ll^rlli^r^O |jdr||i 

where Kg can depend on n. The constant Kg is a crucial technical quantity in our 
analysis and we need to bound it away from zero. In the leading cases that condition 
RSE holds this will in fact be the case as the sample size grows, namely 

I/HQ- < 1- (28) 
Indeed, we can bound Kg from below by 



Kg ^ maxfc(m) ( 1 — \i(m)c\J sjm ) ^ Jc(ilogn) ( 1 — J u(ilog«)c-\/l/logn 

m>0 V /V 

by LemmafTOl stated and proved in the appendix. Thus, under the condition RSE, as 
n grows, K £ is bounded away from zero since K"(slogn) is bounded away from zero 
and <p (s log n) is bounded from above as in (f25l >. Several other primitive assumptions 
can be used to bound K € . We refer the reader to for a further detailed discussion 
of lower bounds on Kg. 

We next state a non-asymptotic performance bound for the LASSO estimator. 

Theorem 1 (Non-Asymptotic Bound for LASSO). Under condition ASM, the 
event A ^ c«j|5||oo implies 

W-M2,n^U+ X -)^-+2c s , (29) 
V c J nKg 

where c s = in the parametric case, and c = (c+ l)/(c — 1). Thus, ifX cn||5||oo 
with probability at least 1 — a, as guaranteed by either X-independent or X- 
dependent penalty levels M7\) and ( Ii7l ), then the bound \29\ occurs with probability 
at least 1 — a. 

The proof of Theorem [TJ is given in the appendix. The theorem also leads to the 
following useful asymptotic bounds. 



Corollary 1 (Asymptotic Bound for LASSO). Suppose that conditions ASM and 
RSE hold. If A, is chosen according to either the X-independent or X-dependent rule 
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specified in ( 1771 ) and ( [751 ) with a = o(l), log(l/a) < \ogp, or more generally so 

that 

X <p G\/nlogp and X ^ c'«||S||oo wp — >■ 1, (30) 

for some c' > 1, then the following asymptotic bound holds: 

\\P-Poh,n < P o^^+c s . 

The non-asymptotic and asymptotic bounds for the empirical risk immediately 
follow from the triangle inequality: 

^Kl(f- X m^\\p-pohn + c s . (31) 

Thus, the rate of convergence of x^j3 to f coincides with the rate of convergence 
of the oracle estimator y/c* + a 2 s/n up to a logarithmic factor of p. Nonetheless, 
the performance of LASSO can be considered optimal in the sense that under gen- 
eral conditions the oracle rate is achievable only up to logarithmic factor of p (see 
Donoho and Johnstone IfTTll and Rigollet and Tsybakov 11201 ), apart from very ex- 
ceptional, stringent cases, in which it is possible to perform perfect or near-perfect 
model selection. 



4 Model Selection Properties and Sparsity of LASSO 

The purpose of this section is, first, to provide bounds (sparsity bounds) on the 
dimension of the model selected by LASSO, and, second, to describe some special 
cases where the model selected by LASSO perfectly matches the "true" (oracle) 
model. 



4.1 Sparsity Bounds 

Although perfect model selection can be seen as unlikely in many designs, sparsity 
of the LASSO estimator has been documented in a variety of designs. Here we 
describe the sparsity results obtained in Q . Let us define 

m:=|r\r| = ||/H| 0) 

which is the number of unnecessary components or regressors selected by LASSO. 

Theorem 2 (Non-Asymptotic Sparsity Bound for LASSO). Suppose condition 
ASM holds. The event X ^ cn ||5||oo implies that 
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m s ■ 



min <p(mAn) 



where Jt = {m G N : m > s(j)(m An) -2L} andL = [2c/~Kc + 3(c+ \)nc s / {Xyfs)} 1 . 

Under Conditions ASM and RSE, for n sufficiently large we have 1/k"j < 1, 
c s Oy/s/n, and §{s\ogn) < 1; and under the conditions of Corollary Q] A > cO\fn 
with probability approaching one. Therefore, we have that L <p 1 and 

slogn > s<j)(slogn) -2L, that is, ilogn 6 

with probability approaching one as n grows. Therefore, under these conditions we 
have 

min <p(mAn) <p 1. 

Corollary 2 (Asymptotic Sparsity Bound for LASSO). Under the conditions of 
Corollary\l\ we have that 

m <p s. (32) 

Thus, using a penalty level that satisfies (f30b LASSO's sparsity is asymptotically 
of the same order as the oracle sparsity, namely 

s:=\T\^s + m< P s. (33) 

We note here that Theorem[2]is particularly helpful in designs in which min me // (j) (m) 
<C 0(«). This allows Theorem [2] to sharpen the sparsity bound of the form s~<p 
s<p(n) considered in JS] and liTHl . The bound above is comparable to the bounds in 
(26l in terms of order of magnitude, but Theorem[2]requires a smaller penalty level 
A which also does not depend on the unknown sparse eigenvalues as in 



4.2 Perfect Model Selection Results 



The purpose of this section is to describe very special cases where perfect model 
selection is possible. Most results in the literature for model selection have been 
developed for the parametric case only ( |[T8l . lfT7l ). Below we provide some results 
for the nonparametric models, which cover the parametric models as a special case. 

Lemma 3 (Cases with Perfect Model Selection by Thresholded LASSO). Sup- 
pose condition ASM holds. (1) If the non-zero coefficients of the oracle model are 
well separated from zero, that is 

min |jSo,-| > C + 1 1 f or some f ^ £ := max |/3, — j3o,|, 
1 i i /■ 

then the oracle model is a subset of the selected model, 
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T := support(j3o) CT:= support(/3). 



Moreover the oracle model T can be perfectly selected by applying hard-thresholding 
of level t to the LASSO estimator j3: 



T={j£{l,..., P } : |j3/| >*}. 



(2) In particular, ifX ^ cn||5||oo, then for m = \T \ T\ = ||j3j-c||o we have 



2c s 



c ) nKcK(m) jc(m) 

(3) In particular, if A cnj|5||oo, ana? f/zere z's a constant U > 5c such that the em- 
pirical Gram matrix satisfies [En [jcyJCjt ] | < 1 / [t/s] /or aZZ 1 < j < k ^ p, then 



A f/ + c 



n U-5c 



■ c s 



6c 



Ac n c? 



U - 5c y/s U A s 



Thus, we see from parts (1) and (2) that perfect model selection is possible under 
strong assumptions on the coefficients' separation away from zero. We also see from 
part (3) that the strong separation of coefficients can be considerably weakened in 
exchange for a strong assumption on the maximal pairwise correlation of regressors. 
These results generalize to the nonparametric case the results of ifTTl and |[T8l for 
the parametric case in which c s = 0. 

Finally, the following result on perfect model selection also requires strong as- 
sumptions on separation of coefficients and the empirical Gram matrix. Recall that 
for a scalar v, sign(v) = v/\v\ if |v| > 0, and otherwise. If v is a vector, we apply 
the definition componentwise. Also, given a vector x £ M. p and a set T C {l,...,p}, 
let us denote Xj [T] := {xjjj £ T}. 

Lemma 4 (Cases with Perfect Model Selection by LASSO). Suppose condition 
ASM holds. We have perfect model selection for LASSO, T = T, if and only if 

|e„ [x,-[r'Mr]']E„ [x^xjiT}'}- 1 {MxiiTjut} 

-|- s ign(/3 [r])}-E„[x ; -[r c ]« ; -]|^ A, 
Poj+ (e„ [xip^r]']- 1 {E„[x,[7>,] - |sign(j3o[r])}) 



min/gr 



>0. 



The result follows immediately from the first order optimality conditions, see 
ll25l . l27l and Q provides further primitive sufficient conditions for perfect model 
selection for the parametric case in which m, = £,-. The conditions above might typi- 
cally require a slightly larger choice of A than ( TPTI i and larger separation from zero 
of the minimal non-zero coefficient min/ e r |j3o/|. 
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5 Analysis of Post-LASSO 

Next we study the rate of convergence of the Post-LASSO estimator. Recall that for 
T = support Q3), the Post-LASSO estimator solves 

B G arg min Q(B) : B, = for each j G f c . 

pew 

It is clear that if the model selection works perfectly (as it will under some rather 
stringent conditions discussed in Section l4~2l i, that is, T = T, then this estimator is 
simply the oracle least squares estimator whose properties are well known. How- 
ever, if the model selection does not work perfectly, that is, T ^ T, the resulting 
performance of the estimator faces two different perils: First, in the case where 
LASSO selects a model T that does not fully include the true model T, we have 
a specification error in the second step. Second, if LASSO includes additional re- 
gressors outside T, these regressors were not chosen at random and are likely to be 
spuriously correlated with the disturbances, so we have a data-snooping bias in the 
second step. 

It turns out that despite of the possible poor selection of the model, and the afore- 
mentioned perils this causes, the Post-LASSO estimator still performs well theo- 
retically, as shown in Q. Here we provide a proof similar to [6| which is easier 
generalize to non-Gaussian cases. 

Theorem 3 (Non-Asymptotic Bound for Post-LASSO). Suppose condition ASM 
holds. If X c«||5||oo holds with probability at least 1 — a, then for any J > there 
is a constant Ky independent ofn such that with probability at least 1 — a — J 

K[m) V n v nKc \ cnKg J 

This theorem provides a performance bound for Post-LASSO as a function of 
LASSO's sparsity characterized by m, LASSO's rate of convergence, and LASSO's 
model selection ability. For common designs this bound implies that Post-LASSO 
performs at least as well as LASSO, but it can be strictly better in some cases, and 
has a smaller shrinkage bias by construction. 

Corollary 3 (Asymptotic Bound for Post-LASSO). Suppose conditions of Corol- 
lary\l\hold. Then 

W-Mi.n<p oJ S -^+c s . (34) 
V n 

If further in = o(s) and T C_T with probability approaching one, then 



\\P-M2, n <p <y 



log p IS 
n 



+ c s . 



(35) 
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IfT = T with probability approaching one, then Post-LASSO achieves the oracle 
performance 



|j3- folk* </> a, 



(36) 



It is also worth repeating here that finite-sample and asymptotic bounds in other 
norms of interest immediately follow by the triangle inequality and by definition of 

K(m): 



E„[(x;/3 -,m < ||J3-A)||2,„ + c s and ||/J- ft|| • ||j3 -J3 || 2) „/jc(m). (37) 

The corollary above shows that Post-LASSO achieves the same near-oracle rate 
as LASSO. Notably, this occurs despite the fact that LASSO may in general fail to 
correctly select the oracle model T as a subset, that is T % T. The intuition for this 
result is that any components of T that LASSO misses cannot be very important. 
This corollary also shows that in some special cases Post-LASSO strictly improves 
upon LASSO's rate. Finally, note that Corollary [3] follows by observing that under 
the stated conditions, 



\P-Poh,n<P C 



— + J- + l{T%T}\— £i- 

n V n V n 



(38) 



6 Estimation of Noise Level 

Our specification of penalty levels ( fT8l and ( fTTI ) require the practitioner to know the 
noise level a of the disturbances or at least estimate it. The purpose of this section 
is to propose the following method for estimating a. First, we use a conservative 
estimate a = ^/Var„ [y,-] := \/E„ [(y,- — y) 2 ], where y = E„[y,], in place of a 2 to 
obtain the initial LASSO and Post-LASSO estimates, j3 and j3. The estimate a is 
conservative since <7° = c + op(l) where (7° = ^/Var[y,-] ^ a, since xi contains a 
constant by assumption. Second, we define the refined estimate a as 

in the case of LASSO and 




in the case of Post-LASSO. In the latter case we employ the standard degree-of- 
freedom correction with s~= ||j8||o = \T\, and in the former case we need no ad- 
ditional corrections, since the LASSO estimate is already sufficiently regularized. 
Third, we use the jefined estimate a 2 to obtain the refined LASSO and Post-LASSO 
estimates j3 and j3. We can stop here or further iterate on the last two steps. 
Thus, the algorithm for estimating a using LASSO is as follows: 
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Algorithm 1 (Estimation of a using LASSO iterations) Set d° = ^JVar n [yi] and 
k = 0, and specify a small constant V > 0, the tolerance level, and a constant I > 1, 
the upper bound on the number of iterations. (1) Compute the LASSO estimator j3 
based on X = 2cd k A{\ - a\X). (2) Set 

e k+l = yf&jT). 

(3) If \d k+l — G k \^Vork+l^L, then stop and report a = G k+l ; otherwise set 
k <— k + 1 and go to (1 ). 

And the algorithm for estimating a using Post-LASSO is as follows: 

Algorithm 2 (Estimation of a using Post-LASSO iterations) Set d° = y/Var„\yi\ 
and k = 0, and specify a small constant V ^ 0, the tolerance level, and a constant 
I > 1, the upper bound on the number of iterations. (1) Compute the Post-LASSO 
estimator j3 based on X — 2cO k A (1 — a\X). (2) Set 

V n — s 

where s~= ||j3||o = \T\. (3) If \d k+x — O k \ < V or k + 1 ^ /, then stop and report 
a = <J k+[ ; otherwise, set k k+ 1 and go to (1). 

We can also use X = 2ca k y/n<P~ l (1 — 0c/2p) in place of X-dependent penalty. We 
note that using LASSO to estimate a it follows that the sequence a k , k ^ 2, is 
monotone, while using Post-LASSO the estimates a k , k 1, can only assume a 
finite number of different values. 

The following theorem shows that these algorithms produce consistent estimates 
of the noise level, and that the LASSO and Post-LASSO estimators based on the re- 
sulting data-driven penalty continue to obey the asymptotic bounds we have derived 
previously. 

Theorem 4 (Validity of Results with Estimated a). Suppose conditions ASM and 
RES hold. Suppose that O ^ CT° < (7 with probability approaching 1 and s log p jn — > 
0. Then O produced by either Algorithm lor2 is consistent 

(7/(7 ->p 1 

so that the penalty levels X = 2cd k A(\ - a\X) and X = 2cd k y /n<P~ 1 (1 — cc/2p) 
with a = o(l), and log(l/a) < log/9, satisfy the condition A30\l of Corollary 1, 

namely 

X <p O^/nlogp and X ^ c'n||S||<x> wp — > 1, (39) 

for some 1 < c' < c. Consequently, the LASSO and Post-LASSO estimators based 
on this penalty level obey the conclusions of Corollaries 1, 2, and 3. 
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7 Monte Carlo Experiments 

In this section we compare the performance of LASSO, Post-LASSO, and the 
ideal oracle linear regression estimators. The oracle estimator applies ordinary least 
square to the true model. (Such an estimator is not available outside Monte Carlo 
experiments.) 

We begin by considering the following regression model: 

+ J3 = (1,1, 1/2, 1/3, 1/4, 1/5,0,... ,0)', 

where x = (1 ,z')' consists of an intercept and covariates z ~ N(Q,E), and the errors 
E are independently and identically distributed e ~ A^(0, (J 2 ). The dimension p of 
the covariates x is 500, the dimension s of the true model is 6, and the sample 
size n is 100. We set X according to the X-dependent rule with 1 — a = 90%. The 
regressors are correlated with Zy = pi' - -'! and p = 0.5. We consider two levels of 
noise: Design 1 with a 2 = 1 (higher level) and Design 2 with a 2 = 0. 1 (lower level). 
For each repetition we draw new vectors x,'s and errors e,'s. 

We summarize the model selection performance of LASSO in Figures [3] and 0] 
In the left panels of the figures, we plot the frequencies of the dimensions of the 
selected model; in the right panels we plot the frequencies of selecting the cor- 
rect regressors. From the left panels we see that the frequency of selecting a much 
larger model than the true model is very small in both designs. In the design with 
a larger noise, as the right panel of Figure [3] shows, LASSO frequently fails to se- 
lect the entire true model, missing the regressors with small coefficients. However, 
it almost always includes the most important three regressors with the largest co- 
efficients. Notably, despite this partial failure of the model selection Post-LASSO 
still performs well, as we report below. On the other hand, we see from the right 
panel of Figure |4]that in the design with a lower noise level LASSO rarely misses 
any component of the true support. These results confirm the theoretical results that 
when the non-zero coefficients are well-separated from zero, the penalized estima- 
tor should select a model that includes the true model as a subset. Moreover, these 
results also confirm the theoretical result of Theorem|2] namely, that the dimension 
of the selected model should be of the same stochastic order as the dimension of the 
true model. In summary, the model selection performance of the penalized estimator 
agrees very well with the theoretical results. 

We summarize the results on the performance of estimators in Table [3] which 
records for each estimator j8 the mean £o-norm ZsUI/jHo], the norm of the bias 
\\Ej} — /Jo 1 1 and also the prediction error EpE„[|jc-(/3 — /Jo)! 2 ] 1 / 2 ] for recovering the 
regression function. As expected, LASSO has a substantial bias. We see that Post- 
LASSO drastically improves upon the LASSO, particularly in terms of reducing the 
bias, which also results in a much lower overall prediction error. Notably, despite 
that under the higher noise level LASSO frequently fails to recover the true model, 
the Post-LASSO estimator still performs well. This is because the penalized esti- 
mator always manages to select the most important regressors. We also see that the 
prediction error of the Post-LASSO is within a factor \/\ogp of the prediction error 
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of the oracle estimator, as we would expect from our theoretical results. Under the 
lower noise level, Post-LASSO performs almost identically to the ideal oracle esti- 
mator. We would expect this since in this case LASSO selects the model especially 
well making Post-LASSO nearly the oracle. 

Monte Carlo Results 



Design 1 (rj 2 = 1) 





Mean ^o-norm 


Bias 


Prediction Error 


LASSO 


5.41 


0.4136 


0.6572 


Post-LASSO 


5.41 


0.0998 


0.3298 


Oracle 


6.00 


0.0122 


0.2326 




Design 2 (o~ 


2 =0.1) 






Mean fo-norm 


Bias 


Prediction Error 


LASSO 


6.3640 


0.1395 


0.2183 


Post-LASSO 


6.3640 


0.0068 


0.0893 


Oracle 


6.00 


0.0039 


0.0736 



Table 3 The table displays the average £o _norm of the estimators as well as mean bias and predic- 
tion error. We obtained the results using 1000 Monte Carlo repetitions for each design. 

The results above used the true value of a in the choice of X. Next we illustrate 
how a can be estimated in practice. We follow the iterative procedure described in 
the previous section. In our experiments the tolerance was 10~ 8 times the current 
estimate for a, which is typically achieved in less than 15 iterations. 

We assess the performance of the iterative procedure under the design with the 
larger noise, O 2 = 1 (similar results hold for <7 2 = 0.1). The histograms in Fig- 
ure [5] show that the model selection properties are very similar to the model se- 
lection when a is known. Figure [6] displays the distribution of the estimator a of 
a based on (iterative) Post-LASSO, (iterative) LASSO, and the initial estimator 
<7° = ^/Var n [y,-]. As we expected, estimator a based on LASSO produces estimates 
that are somewhat higher than the true value. In contrast, the estimator a based on 
Post-LASSO seems to perform very well in our experiments, giving estimates a 
that bunch closely near the true value a. 



8 Application to Cross- Country Growth Regression 

In this section we apply LASSO and Post-LASSO to an international economic 
growth example. We use the Barro and Lee J4J data consisting of a panel of 138 
countries for the period of 1960 to 1985. We consider the national growth rates in 
GDP per capita as a dependent variable y for the periods 1965-75 and 1975-850 In 



2 The growth rate in GDP over a period from t\ to ?2 is commonly defined as log(GDP ( , /GDP tl ). 
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our analysis, we will consider a model with p = 62 covariates, which allows for a 
total of n = 90 complete observations. Our goal here is to select a subset of these 
covariates and briefly compare the resulting models to the standard models used in 
the empirical growth literature (Barro and Sala-i-Martin (5)). 

Let us now turn to our empirical results. We performed covariate selection using 
LASSO, where we used our data-driven choice of penalty level A in two ways. 
First we used an upper bound on a being (7° and decreased the penalty to estimate 
different models with A, A /2, A /3, A/4, and A/5. Second, we applied the iterative 
procedure described in the previous section to define A" (which is computed based 
on d" obtained using the iterative Post-LASSO procedure). 

The initial choice of the first approach led us to select no covariates, which is 
consistent with over-regularization since an upper bound for a was used. We then 
proceeded to slowly decrease the penalty level in order to allow for some covariates 
to be selected. We present the model selection results in Table With the first 
relaxation of the choice of A, we select the black market exchange rate premium 
(characterizing trade openness) and a measure of political instability. With a second 
relaxation of the choice of A we select an additional set of variables reported in the 
table. The iterative approach led to a model with only the black market exchange 
premium. We refer the reader to jfO and ID for a complete definition and discussion 
of each of these variables. 

We then proceeded to apply ordinary linear regression to the selected models and 
we also report the standard confidence intervals for these estimates. Table [8] shows 
these results. We find that in all models with additional selected covariates, the lin- 
ear regression coefficients on the initial level of GDP is always negative and the 
standard confidence intervals do not include zero. We believe that these empirical 
findings firmly support the hypothesis of (conditional) convergence derived from 
the classical Solow-Swan-Ramsey growth model0 Finally, our findings also agree 
with and thus support the previous findings reported in Barro and Sala-i-Martin 0, 
which relied on ad-hoc reasoning for covariate selection. 
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3 The inferential method used here is actually valid under certain conditions, despite the fact that 
the model has been selected; this is demonstrated in a work in progress. 
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Confidence Intervals after Model Selection 
for the International Growth Regressions 



Penalization Real GDP per capita (log) 

Parameter 

A = 2.7870 Coefficient 90% Confidence Interval 



X" = 2.3662 


-0.0112 


[-0.0219, 


-0.0007] 


A/2 


-0.0120 


[-0.0225, 


-0.0015] 


A/3 


-0.0153 


[-0.0261, 


-0.0045] 


A/4 


-0.0221 


[-0.0346, 


-0.0097] 


A/5 


-0.0370 


[-0.0556, 


-0.0184] 



Table 4 The table above displays the coefficient and a 90% confidence interval associated with 
each model selected by the corresponding penalty level. The selected models are displayed in 
Table 



Model Selection Results for the International Growth Regressions 



Penalization 




Parameter 


Real GDP per capita (log) is included in all models 


A = 2.7870 


Additional Selected Variables 


A 




A" 


Black Market Premium (log) 


A/2 


Black Market Premium (log) 


Political Instability 


A/3 


Black Market Premium (log) 
Political Instability 
Ratio of nominal government expenditure on defense to nominal GDP 
Ratio of import to GDP 


A/4 


Black Market Premium (log) 
Political Instability 
Ratio of nominal government expenditure on defense to nominal GDP 


A/5 


Black Market Premium (log) 
Political Instability 
Ratio of nominal government expenditure on defense to nominal GDP 
Ratio of import to GDP 
Exchange rate 
% of "secondary school complete" in male population 
Terms of trade shock 
Measure of tariff restriction 
Infant mortality rate 
Ratio of real government "consumption" net of defense and education 
Female gross enrollment ratio for higher education 



Table 5 The models selected at various levels of penalty. 
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Appendix 
9 Proofs 

Proof (TheoremU). Proceeding similarly to JH), by optimality of j3 we have that 

e(j3)-e(/3oK-||/3o||i--||£||i. (40) 
n n 

To prove the result we make the use of the following relations: for 8 = j3 — j3o, if 
A > cw||S||oo 

Q(P)-Q(Po)-\\8\\l n = -2E B [e^5]-2E„[r^.5] (41) 

2 -||S||oo||5||i-2cJ5|| 2 ^ 

> --(||5r||i + ||5rc||i)-2c s ||5|| 2 ,„, (42) 
cn 

||j3 ||i- ||0||, = HAjtIIi- Hfrlli- IIMi < ||5r||i- ||Mi- (43) 
Thus, combining d40t with (|4TT>— d43t implies that 

--(||5 r || 1 + ||5 r H|i) + ||5||L I -2cJ5!| 2 ,„^-(||5 r ||i-||5 r H|i). (44) 
cn ' « 

If ll^llLi — 2c. s -||5||2,n < 0, then we have established the bound in the statement of 
the theorem. On the other hand, if 1 1 5 1 j \ — 2c s \ \ 8 1 1 2,n ^ we get 

||5r||i^^-||5r||i=c||5r||i, (45) 
c — 1 

and therefore 8 satisfies the condition to invoke RE(c). From (l44l and using RE(c), 
|| 5r ||i Vs\\8\\ 2 , n /Kc, we get 

||^„-2c s || 5 || 2 ,„<fl + I^|| 5 H|i<fl + ^ AVS||5|k " 



c J n \ c J n K e 

which gives the result on the prediction norm. 

Lemma 5 (Empirical pre-sparsity for LASSO). In either the parametric model or 
the nonparametric model, let m = |T* \ X*| and A, c -n||5||oo. We have 



Vm ^ \fs\J §{m) 2c/Kc + 3(c+ 1) (m) nc s /X, 
where c s = in the parametric model. 
Proof. We have from the optimality conditions that 
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2E„[ Xij { yi - X 'M = aga(fij)X/n for each j £ f\T. 
Therefore we have for R — (n , • • • , r„)', X = [x\ , ...,x n ]', and Y = (yi,...,y n )' 

VMx=2\\(x'(Y-xp)) fv \\ 

^2||(X'(F-i?-Xj3o)) nr ||+2||(^) nr ||^+2||(X'X(j3o-^)) fvr || 
^ V / m-n||5||«, + 2« v /^(m)c s + 2n v /0(m)j|^ - j3oj|2,«, 

where we used that 

\\(X>X(Po-p))f\ T \\ < W\Mo<»,MvW x ' x (fo-P)\ 

<sup|| Vrc| | 0<gi| || vN1 || vy||||X(j 3o-^)|| 

= sup|| Vrc || ^|| vN1 vW^P(j8o-j8)|| 

= nV^)||A)-fc 

and similarly || (X'i?)js, r || ^ ny / <j>(m)c s . 

Since A/c ^ «||S||», and by Theorem ffl ||j3 - j3||2,„ «S (l + ±)^|+2c,, we 
have 

(1 - l/c)Vm sc 2y/(j>{m)(l + l/c)y/s/lCe + 6y/^{m) nc s /X. 
The result follows by noting that (1 — 1 /c) = 2/(c + 1) by definition of c. 
Proof (Proof of Theorem^. Since A > c • n||S||oo by Lemma|5]we have 



■ m 



^ y/(j>(m)- 2cy/s/Kg + 3(c+l)^(j>(m)-nc s /X, 



which, by letting L= (j^ +3(c + 1) ^ , can be rewritten as 

m ^ s ■ (j)(m)L. (46) 

Note that m ^ n by optimality conditions. Consider any M € ./#, and suppose m > 
M. Therefore by Lemma|9]on sublinearity of sparse eigenvalues 



in ^ s ■ 



m 
M 



<p(M)L. 



Thus, since \k] < 2k for any k > 1 we have 

M<s-2(j)(M)L 

which violates the condition of M € ^# and s. Therefore, we must have m ^ M. 
In turn, applying d46l ) once more with m ^ (M A «) we obtain 

m ^ * ■ <j)(M An)L. 

The result follows by minimizing the bound over M <G j$ . 
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Proof (Lemma\3\ part (1)). The result follows immediately from the assumptions. 

Proof (Lemma\3\part (2)).Letm = \f\T\ = \\Ptc\\ . Then, note that ||5||«, < ||<5|| sC 
||5||2, n / K(fn). The result follows from TheoremQ] 

Proof (Lemma\3\ part (3)). Let 8 := j3 — j8o- Note that by the first order optimality 
conditions of j3 and the assumption on A 

| |E„ MS] ||. < 1 1 E„ [ Xi (y,- - x'M Hoo+I | S/2 Hoo+I | E„ for,] 1 1 M 
A A f a 

^ ^ + ~ 1" mm \ — 7= , C 



2n 2cn [ yTi 

since ||E,,[jr,r/]||oo ^ min|-^,c. s | by Lemma[6]below. 

Next let ej denote the jth-canonical direction. Thus, for every j = l,...,p we 
have 

\E„[e'jX i 48]-8 J ] = \E n [e'j(x i x' i -I)8}\ max | (E„ [M -/]),■* | \\S\U 

^\\S\U/[Us}. 

Then, combining the two bounds above and using the triangle inequality we have 
\\8\\„^\\E n [xri8\\\- + \\E n frfit i S]-S\\-^(l + ^ ^+mm\^=,c s 



c J 2n \ y/n ' J Us 

The result follows by Lemma|7]to bound 1 1 8 \ \ \ and the arguments in (8l and 1(171 to 
show that the bound on the correlations imply that for any C > 



Kc> Jl-s(l+2C)\\E n [x i x',~I}\[ 



so that Kc > \Jl - [(1 +2c)/U] and K 2€ ^ y/l - [(1 +4c)/U] under this particular 
design. 

Lemma 6. Under condition ASM, we have that 

||E„[x;r,-]||oo < mirJ -^=,c s 



Proof. First note that for every j = 1, ...,p, we have |E„ [jtyr,-] | ^ y E„ [x?-]E„ [r?] : 
Next, by definition of /3o in (fTTT ), for € I we have 

E„[x, 7 (/«-^j3o)]=E n [x i7 r ( -] = 
since /3o is a minimizer over the support of /3o- For j £ T c we have that for any t £ 



n « 
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Therefore, for any t € K we have 

-oZ/n^EnKfi-x'iPo-tXijfj-EnKfi-x'iPo) 2 } = -2fE„[.^(/ ! --^A))]+f 2 E„[4]- 
Taking the minimum over t in the right hand side at?* = ¥, n [xij(f — ,v'j3o)] we obtain 

-0 2 /n<-(E n [xij(fi-x'M) 2 
or equivalent^, |E„[xy(/j-xJj3o)]| ^ o/^/n. 
Lemma 7. IfX ^ cn||S||oo, then for c = (c+ l)/(c— 1) we have 

- Wl < <1^M f ( , + 1) M + 2 J + (, + > ) iLL», 

fC2e L\ c) nKc \ 2c)c—\k 

where c s = in the parametric case. 

Proof. First, assume \\8jc\\ t ^ 2c||5r || i. In this case, by definition of the restricted 
eigenvalue, we have 

||5||i<(l + 2c)||5r||i^(l+2c)V^||5|| 2in /K- 2 c 

and the result follows by applying the first bound to j|5||2,„ since c > 1. 

On the other hand, consider the case that 1 1 Sjc || \ > 2c\\ 8r || l which would already 
imply || 5j|2,, 7 ^ 2c s . Moreover, the relation d44b implies that 

IIVIIi <c||5 r || 1 + ^J||5|| 2 , n (2c I -||5|| 2 . n ) 
<c||6 r || 1 + ^ T ^ 

Thus, 

The result follows by adding the bounds on each case and invoking TheoremQ] 
to bound 1 1 5 1 1 2,» - 

Proof (Theorem^}. Let 5 := j3 — j3o- By definition of the Post-LASSO estimator, it 
follows that Q(P) ^ Q(fi) and 2(j8) < g(j3 of ). Thus, 

e(W-G(ft)<(G(j8)-G(j3o))A(G(j3 of )-G(j8 )) = :j b„aC„. 

The least squares criterion function satisfies 
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|e03)-e(/3oM|S||Ll < |5'5]+2c. s ||5|| 2) „ 

|^ r 5| + |^5|+2c,||5|| 2 ^ 

< ||5r||||5|| + ||5rc|U||5 r c|| 1 + 2 c ,||5|| 2i „ 

< ||5r||||5!| + ||V||^||5||+2 c ,||5|| 2 ,„ 

< \\Sr\\^ + \\Sr4^^h +2c s \\8\\ 2 , n . 

K(m) K(m) 

Next, note that for any j G {1, . . . ,p] we have£[S]] =4a 2 /n, so that £ [|j 1 1 2 ] ^ 
4a 2 s/n. Thus, by Chebyshev inequality, for any f > 0, there is a constant Ay such 
that ||Sy|| ^ AyO y/s/n with probability at least 1 — y. Moreover, using Lemma[8] 
1 1 5t-c 1 1 ^ A'y2o^j2\ogp jn with probability at least 1 — y for some constant Ay. 

Define Ap, := KyO y/{s + fh log p)/n so that Ap, ^ \\St\\ + v / «||5'7T ||oo with proba- 
bility at least 1 — 7 for some constant Ky < °° independent of n and p. 
Combining these relations, with probability at least 1 — y we have 

1 1 8 1| L - A 7 .„ 1 1 5 1| 2 ,n/K(m) - 2c s 1 1 5 1 1 2 ,„ < fi„ A Q, , 
solving which we obtain: 

|| 5|| 2 ,„ < A 7 ,„/k (in) + 2c s + v /(fi„)+A(C n ) + . (47) 
Note that by the optimality of j3 in the LASSO problem, and letting <5 = j3 — j3o, 
B» = Q(P)-Q(A))<|(llft||i-||?||i)<|(l|ft-||i-||^||i). (48) 

If ||Sr||i > c|| §r ||i, we have Q(P)-Q(Po) <0 since c> 1. Otherwise, if ||5 r H |i < 
c|| 5r|| 1, by RE(c) we have 

B n := ^ -\\8rh <S (49) 

n n Kg 

The choice of X yields X ^ 1 1 1 1 00 with probability 1 — a. Thus, by applying 
TheoremQ] which requires X ^ cn||S||oo, we can bound ||5|| 2 ,„. 

Finally, with probability 1 - a - y we have that (f4Tb and d49l with ||5|| 2i „ < 
(1 + 1 /c)X^fs/nKc+2c s hold, and the result follows since if T C r we have C„ = 
so that B„ AC„ < l{r ^ f }B„. 

Proof (Theorem 0. Consider the case of Post-LASSO; the proof for LASSO is 
similar. Consider the case with k= 1, i.e. when a = a k for k = 1. Then we have 
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|J3 — J3o| 



2.1, 



|5|W|/3-j3, 



•oil 1 



2C S \\P-P \\2,n , 2c V E " [e ' ] 



^(1). 



since ||J3 — jSolbn CTa/ (s/n)logp by Corollary [3] and by assumption on a , 

llSlloo </• av^Wlo^by Lemma[8] ||/?-j3 ||i < V* ||j3 -j3|| 2 </> V* 

by condition RSE, s~<p s by Corollary [2] and c s < Gyjs/n by condition ASM, and 

p by the Chebyshev inequality. Fi- 

0. The 



1 



s\ogp/n — s- by assumption, and — T 

nally, n/(n — s) = 1 +£>/>( 1) since 5 </> 5 by Corollary [2] and slogp/n 
result for 2 ^ A: ^ / — 1 follows by induction. 



10 Auxiliary Lemmas 

Recall that ||5/(2cr)|| o = maxi^ 7 ^ p |E„[x; ; -,g;] |, where gi are i.i.d. N(0, 1), for i = 
l,..,,n, conditional on X = [x^, ...,x' n ]', and E„[x?-] = 1 for each j = l,,..,p, and 
note that P(n||5/(2(7)||co > A(l — a|X)|X) = a by definition. 

Lemma 8. We have that for t > 0: 

P(n\\S/(2o)\\„ > tVH\X) s? 2p(l - <P(t)) sc 2pj<l>(t), 

A(l-a|Z)< v / «^ 1 (l-a/2p)< v / 2«log(2p/a), 
P(h||5/(2c7)||oc ^ y/2n\og(2p/a)\X) < a. 

Proof. To establish the first claim, note that v / n||5/2o'||oo = maxi^ 7 ^ p \Zj\, where 
Zj = y/nEn[xijgi] are N(0, 1) by gi i.i.d. N(0, 1) conditional on X and by E„[x?-] = 
1 for each j = I,..., p. Then the first claim follows by observing that for z 
by the union bound P(maxi^j^ p \Zj\ > z) < pP(|Z/| > z) = 2p(l - <P(z)) and by 
(1 — <P(z)) = <j)(u)du ^ J^°(u/z)(j>(u)dz ^ (1 /z)0(z). The second and third claim 
follow by noting that 2p(l -#(*')) = « at t' = <P~ l (1 - a/2/?), and 2p^r(j>(t") = a 
at f" ^ v /21og(2p/a), so that, in view of the first claim, A(l — a\X) < y/nt' < 



Lemma 9 (Sub-linearity of restricted sparse eigenvalues). For any integer k^O 
and constant i 1 we /lave ( [tt] ) ^ \£~\<j> (Ic). 

Proof Let W := E„[x,-x'.] and a be such that (j)(\£k\) = a'Wa, \\a\\ = 1. We can 
decompose the vector a so that 
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m m 

a = E«m with E KHIo = \\&t4o and a, T = a T / \£] , 

1=1 1=1 

where we can choose a,-'s such that 1 1 cCjjc 1 1 ^ A: for each i = 1, |Y| , since \£] k ^ 
\ik~\ . Note that the vectors OC;'s have no overlapping support outside T. Since W is 
positive semi-definite, a-Woti + oCjWcCj ^ 2 |a ( 'Wa/| for any pair Therefore 

m m 

(j>(\£k]) = a'Wa = Y,Y* a i Wa i 
W a'ffcii + a'ffa, ^ , 

< EE 7 ; = mE^ft 

i=i j=\ l i=i 

< MElNfMKHIoK ffl max 0(|ja,HloK 

,=i i=i,...,m 



where we used that 

m m „ ft „2 m 

E ||«,|| 2 = E(||«,t|| 2 + KtHI 2 ) = htt+EIIoH 2 < ll«ll 2 = I- 
1=1 1=1 l £ l 1=1 

Lemma 10. Let c = (c + 1 )/ (c — 1 ) we have for any integer m > 



Kg £ K(m) ^1 - ju(m)cy ~ 

Proof. We follow the proof in [8|. Pick an arbitrary vector 5 such that j|<5;r<:j|i ^ 
c\\ i. Let T l denote the m largest components of 8t c . Moreover, let T c = L)f =l T k 
where K = \(p — s)/m\, \T k \ ^ m and T k corresponds to the m largest components 
of 5 outside ru^^). 
We have 



K 

\\8h.n > l|5 7 - ur i||2,, I -||5 (rur i )c ||2,„ > K(m)\\8 TUT i\\ - E ||5 r t|| 2 ,« 

k=2 



K 

> k('«)II<W'II ~ V'K'") E W 8 Tk\ 

k=2 

Next note that 



\\S t m\\ < HSj-illi/ 

Indeed, consider the problem max{|| v||/|| u\\ i : v, u £ R m ,max,- |v,-| ^ min, \ut\}. Given 
a v and u we can always increase the objective function by using v = max,- 1 v,-| (1, . . . , 1)' 
and it' = min,- |«,-|(1, . . . , 1)' instead. Thus, the maximum is achieved at v* = u* = 
(1,...,1)', yielding 1/y/in. 

Thus, by ||§rc||i ^c||§r||i and \T\ =s 
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K K-l 




||<>Hli 



EIIM< E 



k=l k=\ 



Therefore, combining these relations with ||5 rur i|| ^ ||5r|| > ||5r||i/\/j we have 
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Fig. 2 The figures illustrate the geometry of LASSO and Post-LASSO estimator. 
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Total Number of Components Selected 



Number of Correct Components Selected 




o IE 



Fig. 3 The figure summarizes the covariate selection results for the design with o = 1, based 
on 1000 Monte Carlo repetitions. The left panel plots the histogram for the number of covariates 
selected by LASSO out of the possible 500 covariates, The right panel plots the histogram for 
the number of significant covariates selected by LASSO, \T n T\; there are in total 6 significant 
covariates amongst 500 covariates. The sample size for each repetition was n = 100. 



Total Number of Components Selected 



Number of Correct Components Selected 



-2 2 




12-2 2 



Fig. 4 The figure summarizes the covariate selection results for the design with a 2 = 0.1, based 
on 1000 Monte Carlo repetitions. The left panel plots the histogram for the number of covariates 
selected out of the possible 500 covariates, \T\. The right panel plots the histogram for the number 
of significant covariates selected, | T fl T\ ; there are in total 6 significant covariates amongst 500 
covariates. The sample size for each repetition was n = 100. 
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Total Number of Components Selected Number of Correct Components Selected 




Fig. 5 The figure summarizes the covariate selection results for the design with (7 = 1, when 
C7 is estimated, based on 1000 Monte Carlo repetitions. The left panel plots the histogram for the 
number of covariates selected out of the possible 500 covariates. The right panel plots the histogram 
for the number of significant covariates selected; there are in total 6 significant covariates amongst 
500 covariates. The sample size for each repetition was n = 100. 
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Fig. 6 The figure displays the distribution of the estimator a of o~ based on (iterative) LASSO, (it- 
erative) Post-LASSO, and the conservative initial estimator <T° = \/Var„ [y,-] . The plots summarize 
the estimation performance for the design with (7 = 1, based on 1000 Monte Carlo repetitions. 



